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Abstract. We present in this paper first-order alternating linearization algorithms based on an alternating direction 
augmented Lagrangian approach for minimizing the sum of two convex functions. Our basic methods require at most 0(l/e) 
iterations to obtain an e-optimal solution, while our accelerated (i.e., fast) versions of them require at most 0(l/\/e) iterations, 
with little change in the computational effort required at each iteration. For both types of methods, we present one algorithm 
that requires both functions to be smooth with Lipschitz continuous gradients and one algorithm that needs only one of the 
functions to be so. Algorithms in this paper are Gauss-Seidel type methods, in contrast to the ones proposed by Goldfarb and 
Ma in [21] where the algorithms are Jacobi type methods. Numerical results are reported to support our theoretical conclusions 
and demonstrate the practical potential of our algorithms. 
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1. Introduction. In this paper, we are interested in the foUowing convex optimization problem: 

(1.1) miuFix) = f{x) + g{x) 

where /, g : K." — > M are both convex functions such that the foUowing two problems are easy to solve for 
any r > and z G M" relative to minimizing F{x): 

(1.2) min I Tf{x) + -\\x ~ z\\^ 



2' 



and 



(1.3) min{.,(.) + l||.-.|P 

In particular, we are specially interested in cases where solving (1.2) (or (1.3)) takes roughly the same effort 
as computing the gradient (or a subgradient) of /(x) (or g{x), respectively). Problems of this type arise in 
many applications of practical interest. The following are some interesting examples. 

Example 1. £i minimization in compressed sensing (CS). Signal recovery problems in compressed 
sensing [8, 13] use the £i norm ||a;||i := X]r=i l-^'l ^ regularization term to enforce sparsity in the solution 
x G M" of a linear system Ax — b, where A G lg'"X" and b G M™. This results in the unconstrained problem 

(1.4) min I ^\\ Ax - b\\l + p\\x\\i 
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where p > 0, which is of the form of (1.1) with f{x) ~ ^\\Ax — b\\2 and g{x) := p\\x\\i. In this case, the two 
problems (1.2) and (1.3) are easy to solve. Specifically, (1.2) reduces to solving a linear system and (1.3) 
reduces to a vector shrinkage operation which requires 0{n) operations (see e.g., [23]). Depending on the 
size and structure of A solving the system of linear equations required by (1.2) may be more expensive, less 
expensive or comparable to computing the gradient {Ax — b) of f{x). In the application we consider in 
Section 4 these computations are comparable due to the special structure of A. 

Example 2. Nuclear norm minimization (NNM). The nuclear norm minimization problem, which 
seeks a low-rank solution of a linear system, can be cast as 

(1.5) min|ip(X)-fe||^ + p||X|u|, 

where r > 0, X S R™^", A : M™^" ^ is a linear operator, b eW and the nuclear norm is defined 

as the sum of the singular values of the matrix X. Problem (1.5) and a special case of it, the so-called matrix 
completion problem, have many applications in optimal control, online recommendation systems, computer 
vision, etc. (see e.g., [7, 9, 27, 42]). In Problem (1.5), if we let f{X) = ^\\AiX) - and g{X) = p\\X\\^, 
then problem (1.2) reduces to solving a linear system. Problem (1.3) has a closed-form solution that is given 
by matrix shrinkage operation (see e.g., [34]). 

Example 3. Robust principal component analysis (RPCA). The RPCA problem seeks to 
recover a low-rank matrix X from a corrupted matrix M . This problem has many applications in computer 
vision, image processing and web data ranking (see e.g., [6]), and can be formulated as 

(1.6) inm{\\X\U + p\\Y\\i : X + Y = M} , 

where p> 0, M e R"><" and the ii norm j|yj|i := J^i j l^ijl- ^^te that (1.6) can be rewritten as 

mm{\\X\\,+p\\AI-X\\i}, 

which is of the form of (1.1). Moreover, the two problems (1.2) and (1.3) corresponding to (1.6) have 
closed-form solutions given respectively by a matrix shrinkage operation and a vector shrinkage operation. 
The matrix shrinkage operation requires a singular value decomposition (SVD) and is comparable in cost to 
computing a subgradient of \\X\\^, or the gradient of the smoothed version of this function (see Section 5). 

Example 4. Sparse inverse covariance selection (SICS). Gaussian graphical models are of great 
interest in statistical learning. Because conditional independence between different nodes correspond to zero 
entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the 
graph by estimating a sparse inverse covariance matrix from sample data by solving the following maximum 
likelihood problem with an ^i-regularization term, (see e.g., [3, 18, 49, 52]). 

max{logdet(X) - (E, X) - i} , 

or equivalently, 

(1.7) min {- logdct(X) -f- (S, X) + p\\X\\i} , 
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where p > and S € S*" (the set of symmetric positive semidefinite matrices) is the sample covariance 
matrix. Note that by defining f{X) := -logdet(X) + (E,X) and g{X) p\\X\\i, (1.7) is of the form 
of (1.1). Moreover, it can be proved that the problem (1.2) has a closed-form solution, which is given by 
a spectral decomposition - a comparable effort to computing the gradient of f{X), while the solution of 
problem (1.3) corresponds to a vector shrinkage operation. 

Algorithms for solving problem (1.1) have been studied extensively in the literature. For large-scale 
problems, for which problems (1.2) and (1.3) are relatively easy to solve, the class of alternating direction 
methods that are based on variable splitting combined with the augmented Lagrangian method are partic- 
ularly important. In these methods, one splits the variable x into two variables, i.e., one introduces a new 
variable y and rewrites Problem (1.1) as 

(1.8) mm{f{x)+giy):x-y^O}. 

Since Problem (1.8) is an equality constrained problem, the augmented Lagrangian method can be used to 
solve it. Given a penalty parameter 1/ fi, at the fc-th iteration, the augmented Lagrangian method minimizes 
the augmented Lagrangian function 

(1-9) C^{x,y;\) -.= f{x) + g{y) - {X, x - y) + ^\\x - yf , 

with respect to x and y, i.e., it solves the subproblem 

(1.10) (x'=^,y'=) :=argmin£^(a;,y;A'=) 

and then updates the Lagrange multiplier A*^ via: 

(1.11) A'^+i :=A'^--(x-^-/). 

A* 

Minimizing Cfj_{x,y; X) with respect to x and y jointly is often not easy. In fact, it certainly is not any 
easier than solving the original problem (1.1). However, if one minimizes C^{x, y; A) with respect to x and 
y alternatingly, one needs to solve problems of the form (1.2) and (1.3), which as we have already discussed, 
is often easy to do. Such an alternating direction augmented Lagrangian method (ADAL) for solving (1.8) 
is given below as Algorithm 1. 



Algorithm 1: Alternating Direction Augmented Lagrangian Method (ADAL) 

1 Choose /X, A" and x^ =2/°. 

2 for /c = 0, 1, • • • do 



yk+l 



argmiuj; C^,{x,y^\ A'') 
arg miuy £^(a;''+\?/; A*"') 



The history of alternating direction methods (AD Ms) goes back to the 1950s for solving PDEs [14, 41] and 
to the 1970s for solving variational problems associated with PDEs [19, 20]. ADMs have also been applied to 
solving variational inequality problems by Tseng [46, 47] and He et al. [24, 26]. Recently, with the emergence 
of compressive sensing and subsequent great interest in £i minimization [8, 13], ADMs have been applied to 
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£i and total variation regularized problems arising from signal processing and image processing. The papers 
of Goldstein and Osher [22], Afonso et al. [2] and Yang and Zhang [51] are based on the alternating direction 
augmented Lagrangian framework (Algorithm 1), and demonstrate that ADMs are very efficient for solving 
£i and TV regularized problems. The work of Yuan [53] and Yuan and Yang [54] showed that ADMs can 
also efficiently solve ^i-regularized problems arising from statistics and data analysis. More recently, Wen, 
Goldfarb and Yin [50] and Malick ct al. [35] applied alternating direction augmented Lagrangian methods 
to solve scmidefinite programming (SDP) problems. The results in [50] show that these methods greatly 
outperform interior point methods on several classes of well-structured SDP problems. Furthermore, He et 
al. proposed an alternating direction based contraction method for solving separable linearly constrained 
convex problems [25]. 

Another important and related class of algorithms for solving (1.1) is based on operator-splitting. The 
aim of these algorithms is to find an x such that 



where Ti and T2 arc maximal monotone operators. This is a more general problem than (1.1) and ADMs 
for it have been the focus of a substantial amount of research; e.g., see [10, 11, 12, 15, 16, 32, 44]. Since the 
first-order optimality conditions for (1.1) are: 



where df{x) denotes the subdifferential of f{x) at the point x, a solution to Problem (1.1) can be obtained 
by solving Problem (1.12). For example, sec [15, 32, 44] and references therein for more information on this 
class of algorithms. 

While global convergence results for various splitting and alternating direction algorithms have been 
established under appropriate conditions, our interest here is on iteration complexity bounds for such algo- 
rithms. By an iteration complexity bound we mean a bound on the number of iterations needed to obtain 
an e-optimal solution which is defined as follows. 

Definition 1.1. G K" is called an e-optimal solution to (1.1) if F{xe) ~ F{x*) < e, where x* is an 
optimal solution to (1.1). 

Complexity bounds for first-order methods for solving convex optimization problems have been given by 
Nesterov and many others. In [37, 38], Nesterov gave first-order algorithms for solving smooth unconstrained 
convex minimization problems with an iteration complexity of 0{y^L/e), where L is the Lipschitz constant 
of the gradient of the objective function, and showed that this is the best complexity that is obtainable 
when only first-order information is used. These methods can be viewed as accelerated gradient methods 
where a combination of past iterates are used to compute the next iterate. Similar techniques were then 
applied to nonsmooth problems [4, 39, 40, 48] and corresponding optimal complexity results were obtained. 
The ISTA (Iterative Shrinkage/Thresholding Algorithm) and FISTA (Fast Iterative Shrinkage/Thresholding 
Algorithm) algorithms proposed by Beck and Tcboulle in [4] are designed for solving (1.1) when one of 
the functions (say f{x)) is smooth and the other is not. It is proved in [4] that the number of iterations 
required by ISTA and FISTA to get an e-optimal solution to problem (1.1) are respectively 0{L{f)/e) and 
0{y^L{f)/e), under the assumption that V/(x) is Lipschitz continuous with Lipschitz constant L{f), i.e.. 



(1.12) 



OeTi{x)+T2{x), 



(1.13) 



oedf{x) + dg{x), 



II V/(x) - V/(y)||2 < L{f)\\x - y\\2, Vx, y & R". 
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ISTA computes a sequence {cc'^} via the iteration 



(1.14) x''^^ := a,TgmmQf{x,x''), 
where 

(1.15) Qf{u,v) giu) + f{v) + (V/(z;), u - v) + ^^\\u - v\\\ 
while FISTA computes {x^} via the iteration 

arg YmiVxQf{x,y^) 

(i + yrriif) /2 

starting with ti = I, = x^ e M" and fc = 1. 

Note that ISTA and FISTA treat the functions f{x) and g{x) very differently. At each iteration they 
both hnearize the function f{x) but never directly minimize it, while they do minimize the function g{x) in 
conjunction with the linearization of f{x) and a proximal (penalty) term. These two methods have proved 
to be efficient for solving the CS problem (1.4) (see e.g., [4, 23]) and the NNM problem (1.5) (see e.g., 
[34, 45]). ISTA and FISTA work well in these areas because f{x) is quadratic and is well approximated by 
linearization. However, for the RPCA problem (1.6) where two complicated functions arc involved, ISTA 
and FISTA do not work well. As we shall show in Sections 2 and 3, our ADMs are very effective in solving 
RPCA problems. For the SICS problem (1.7), intermediate iterates may not be positive definite, and 
hence the gradient of f{X) = — logdet(Ar) + (S, AT) may not be well defined at X'' . Therefore, ISTA and 
FISTA cannot be used to solve the SICS problem (1.7). In [43], it is shown that SICS problems can be very 
efficiently solved by our ADM approach. 

Our contribution. In this paper, we propose both basic and accelerated (i.e., fast) versions of first- 
order alternating linearization methods (ALMs) based on an alternating direction augmented Lagrangian 
approach for solving (1.1) and analyze their iteration complexities. Our basic methods require at most 0{L/e) 
iterations to obtain an e-optimal solution, while our fast methods require at most 0{yj L/e) iterations with 
only a very small increase in the computational effort required at each iteration. Thus, our fast methods 
arc optimal first-order methods in terms of iteration complexity. For both types of methods, we present 
an algorithm that requires both functions to be continuously differentiable with Lipschitz constants for the 
gradients denoted by £(/) and L{g). In this case L = max{L(/), L(g)}. We also present for each type of 
method, an algorithm that only needs one of the functions, say /(x), to be smooth, in which case L ~ L{f). 
These algorithms are related to the multiple splitting algorithms in a recent paper by Goldfarb and Ma 
[21]. The algorithms in [21] are Jacobi type methods since they do not use information from the current 
iteration to solve succeeding subproblems in that iteration, while the algorithms proposed in this paper are 
Gauss-Seidcl type methods since information from the current iteration is used later in the same iteration. 
These algorithms can also be viewed as extensions of the ISTA and FISTA algorithms in [4] . The complexity 
bounds wc obtain for our algorithms are similar to (and as much as a factor of two better that) those in [4]. 

At each iteration, our algorithms alternatively minimize two different approximations to the original 
objective function, obtained by keeping one function unchanged and linearizing the other one. Our basic 
algorithm is similar in many ways to the alternating linearization method proposed by Kiwiel et al. [28] . In 
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(1.16) 



X 



particular, the approximate functions minimized at each step of Algorithm 3.1 in [28] have the same form as 
those minimized in our algorithm. However, our basic algorithm differs from the one in [28] in the way that 
the proximal terms are chosen, and our accelerated algorithms are very different. Moreover, no complexity 
bounds have been given for the algorithm in [28]. To the best of our knowledge, the complexity results in 
this paper are the first ones that have been given for a Gauss-Seidel type alternating direction method ^. 
Complexity results for related Jacobi type alternating direction methods are given in [21]. 

Organization. The rest of this paper is organized as follows. In Sections 2 and 3 we propose our 
alternating linearization methods based on alternating direction augmented Lagrangian methods and give 
convergence/complexity bounds for them. We compare the performance of our ALMs to other competing 
first-order algorithms using an image deblurring problem in Section 4. In Section 5, we apply our ALMs 
to solve very large RPCA problems arising from background extraction in surveillance video and matrix 
completion and report the numerical results. Finally, we make some conclusion in Section 6. 

2. Alternating Linearization Methods. In iteration of the ADAL method. Algorithm 1, the La- 
grange multiplier A is updated just once, immediately after the augmented Lagrangian is minimized with 
respect to y. Since the alternating direction approach is meant to be symmetric with respect to x and ?/, it is 
natural to also update A after solving the subproblem with respect to x. By doing this, we get a symmetric 
version of the ADAL method. This algorithm is given below as Algorithm 2. 

Algorithm 2: Symmetric Alternating Direction Augmented Lagrangian Method (SADAL) 

1 Choose /z. A" and ~ y'^ . 

2 for /c = 0, 1, • • • do 



y^+^ ;= argminj^£^(a;''+\2/; A''^^) 



This ADAL variant is described and analyzed in [20]. Moreover, it is shown in [20] that Algorithms 1 and 
2 are equivalent to the Douglas-Rachford [14] and Peaceman-Rachford [41] methods, respectively applied to 
the optimality condition (1.13) for problem (1.1). If we assume that both f{x) and g{x) are differentiable, 
it follows from the first order optimality conditions for the two subproblems in lines 3 and 5 of Algorithm 2 
that 

(2.1) A'^'+5 = Vf{x''+^) and \^+^ = ~Vg{y^+^). 

Substituting (2.1) into Algorithm 2, we get the following alternating linearization method (ALM) which is 
equivalent to the SADAL method (Algorithm 2) when both / and g are differentiable. In Algorithm 3, 
Qf{u,v) is defined by (1.15) and 

(2.2) Qg{u,v) := f{u) + g{v) + {\/g{v),u^ y) + l.\\u - v\\l 

In Algorithm 3, we alternatively replace the functions g and / by their linearizations plus a proximal 

^ After completion of an earlier version of the present paper, which is available on http;/ /arxiv.org/abs/0912.4571, Monteiro 
and Svaiter [36] gave an iteration complexity bound to achieve a desired closeness of the current iterate to the solution for 
ADMs for solving the more general problem (1.12). 
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Algorithm 3: Alternating Linearization Method (ALM) 



1 Choose fj, and ~ 

2 for A; = 0, 1, • • • do 

k+l 



y^+^ — argminy (3/(2/, x'^+i] 



regularization term to get an approximation to the original function F. Thus, our ALM algorithm can also 
be viewed as a proximal point algorithm. 

A drawback of Algorithm 3 is that it requires both / and g to be continuously differentiable. In many 
applications, however, one of these functions is nonsmooth, as in the examples given in Section 1. Although 
Algorithm 2 can be applied when f{x) and g{x) are nonsmooth, we are unable to provide a comparable 
complexity bound in this case. However, when only one of the functions of / and g is nonsmooth (say g 
is nonsmooth), the following variant of Algorithm 3 applies, and for this algorithm, we have a complexity 
result. 



Algorithm 4: Alternating Linearization Method with Skipping Steps (ALM-S) 



1 Choose fi, A" and = y". 

2 for /c = 0, 1, • • • do 



liF{x^+^) > C^{x^+^,y^\\^), then x^+^ :== / 



argmin^: Cf,{x,y^\ A*") 



yk+l 



arg miuj^ Qfiv^x^^'^] 
■ V/(a;'^-+i) - (x'=+i - 



We call Algorithm 4, ALM with skipping steps (ALM-S) because in line 4 of Algorithm 4. if 



(2.3) 



F{x'+')> C^{x'+',y'-X') 



holds, we let x^'^^ := y^ . i.e., we skip the computation of a;'^'+^ in line 3. An alternative version of Algorithm 
4 that has smaller average work per iteration is the following Algorithm 5. 



Algorithm 5: Alternating Linearization Method with Skipping Steps (equivalent version) 



1 Choose /z, A° and =2/°. 

2 for A: = 0, 1, • • • do 



3 
4 
5 
6 
7 
8 

10 
11 



2;''+^ := argmin^; C^{x,y''; A'') 

if F{x''+^) > £^(a;'=+\/;A'=) then 



yk+i 
^k+i 



= y 

= argmiuj^ Qf{y,x''+^) 

= V/(a;'^-+i) - (x'^+i - y''+^)/fJ. 



else 



A''^+^ 



A^ 



ix 



y'')/^^ 



arg mm 



in,£^(.T^-+i,2/;A^-+i) 



A^+i := A^-+^ - (x^+i -j/^+i)//i 



Note that in Algorithm 5, when (2.3) does not hold, we switch to the SADAL algorithm, which updates 
A instead of computing V/(a;'^"*"^). Algorithm 5 is usually faster than Algorithm 4 when S/f{x'^'^^) is costly 
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to compute in addition to performing Step 3. Note also that when (2.3) holds, then the steps of the algorithm 
reduce to those of ISTA. 

The following theorem gives conditions under which Algorithms 2, 3, 4 and 5 are equivalent. 

Theorem 2.1. (i) If both f and g are dijferentiable, and Xq is set to — V.g(y*'), then Algorithms 2 and 
3 are equivalent, (ii) If in addition g is Lipschitz continuous with Lipschitz constant L{g), and < 1/ L{g), 
then Algorithms 3 and ^ are equivalent. (Hi) If f is differentiable, then Algorithms 4 o-i^-d 5 are equivalent. 

Proof. When both / and g are differentiable and Aq = —Wg{y'^), (2.1) holds for all fc > 0, and it 
follows that Cf,{x,y^]\^) = Qg{x,y^) and y; A'^'+s ) ee Qf{y^x^^'^). This proves part (i). If V5(a;) 

is Lipschitz continuous and ^ < l/L{g), 

9{x''+') < g{y') + {Vg{y^),x^+' - y") + i- \\x^+' - y% , 

holds (see e.g., [5]). This implies that (2.3) does not hold and hence, x'^'^^ := argmin^; C^{x,y''; X'^), and the 
equivalence of Algorithms 3 and 4 follows. This proves part (ii). The optimality of in line 3 of Algorithm 
5 implies that A''+3 = V/(a;'^+-^) when (2.3) does not hold and hence that C^{x^^^ ,y] A'^+s) = Qf{y^ a;*''+^). 
This proves part (iii). □ 

We show in the following that the iteration complexity of Algorithm 4 is 0(1 /e) for obtaining an e-optimal 
solution for (1.1). First, we need the following generalization of Lemma 2.3 in [4]. 

Lemma 2.2. Let ip : M" -+ M and (j> : R" M &e convex functions and define 
Q^{u,v) -.^ (t){u) + ip{v) + {j^{v),u - v) + ^ll^^'^ll2> 

and 



(2.4) pj,{v) := a.TgmmQ^{u,v), 

u 

where 7,/,(u) is any subgradient in the subdifferential d'ip{v) of 'ip{v) at the point v. Let $(•) = 0(-) +-!/;(•). 
For any v, if 

(2.5) ^{p^{v))<Q^{p^{v),v), 
then for any u, 

(2.6) 2/i(<i>(u) - Hp.^iv))) > \\p^,{v) -u\\^-\\v- uf. 



Proof. From (2.5), we have 

$(u) - $(pv>(w)) > ^u)~Q^,{p^,{v),v) 

= ^u)- (^^{p4v))+i:{v) + {j^{v),p^,{v)-v) + ^\\p4v)-v\\iy 

Since (f> and ip are convex we have 



(2.8) 



<j){u) > (f>{p4,{v)) + {u-p^,{v),j^{p.4,{v))), 
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and 



(2.9) ij{u)>ij{v) + {u-v,j4v)), 

where 70(-) is a subgradient of </>(•) and Jtj,{pip{v)) satisfies the first-order optimahty conditions for (2.4), i.e., 

(2.10) l^{Pi'{v)) + J^{v) + -{p^{v) -v) = 0. 
Summing (2.8) and (2.9) yields 

(2.11) $(u) > <f>{p^{v)) + {u - p^{v),j^{p^,{v))} + i'{v) + {u-v,j^{v)). 
Therefore, from (2.7), (2.10) and (2.11) it follows that 

$(w)-$(p^(w)) > {Ji,{v) + j^{p^{v)),u - p^,{v)) - ^\\P4,{v) - v\\l 

(2.12) = i-hpA'") - v),u~p^{v)) - l^\\p.^{v) - v\\l 

= ^{\\pdv)-uf-\\v-uf). D 



Theorem 2.3. Assume V/(.) is Lipschitz continuous with Lipschitz constant L(f). For fj, < 1/L(f), 
the iterates in Algorithm 4 satisfy 

(2.13) f,,>)_f,,.)<fc!_fl!, 

where x* is an optimal solution of (1.1) and kn is the number of iterations until the k-th for which i^(a;*''+^) < 
£^(2:'^'*'^, y*"'; A*^), i.e., the number of iterations when no skipping step occurs. Thus, the sequence {F{y'')} 
produced by Algorithm 4 converges to F{x*). Moreover, if \/{j3L{f)) < /i < 1/L{f) where /? > 1, the number 
of iterations needed to obtain an e-optimal solution is at most \C/e~\, where C ~ PL{f)\\x^ — x*|p/2. 

Proof. Let / be the set of all iteration indices until k — 1-st for which no skipping occurs and let 1^ be 
its complement. Let / — {n^}, i = 0, . . . , fc„ — 1. It follows that for all n G /c, = y". 

For 71 G / we can apply Lemma 2.2 to obtain the following inequalities. In (2.6), by letting ip ~ f, cj) ~ g, 

u ~ X* and v ~ a;"+^, we get p^{v) = y""^^, $ = F and 

(2.14) 2^,{F{x*) - F{y-+')) > -x*f- \\x'^+' - x* f. 

Similarly, by letting ip = g,(f) = f,u = x* and w = in (2.6) we get Pg{v) = a;""*"^, ^ = F and 

(2.15) 2^l{F{x*) - F(x"+i)) > - x*f - Ijy" - x*f. 
Taking the summation of (2.14) and (2.15) we get 

(2.16) 2tii2F{x*) - F(x"+i) - F(y"+i)) > - x*f - - x*f. 



For n e Ic, (2.14) holds. Then since x"^'^^ = y" we get 

(2.17) 2/x(i^(x*) - W+i)) > -x*f-\\y^- x*f. 

Summing (2.16) and (2.17) over n = 0, 1, . . . , fc — 1 we get 

fc-i 

(2.18) 2m((2|/| + |/e|)F(x*) - ^i^(x"+i) - J2 Fiy""-')) 

n£l n=0 

>Y.{\\y-+'-x*r-\\v--x*\\') 

n=0 

II k *||2 II *||2 

= \\y -X. \\ -\\y - X \\ 
>~\\x°-x*f. 

For any n, since Lemma 2.2 holds for any u, letting u = instead of x* we get from (2.14) that 

(2.19) 2/i(F(a;"+i) - i^(y"+i)) > - x'^+'f > 0, 
or, equivalently, 

(2.20) 2^^iFix-) - F(y")) > - > 0. 

Thus we get i^(2/") < F(a;"),Vn. 

Similarly, for n e / by letting u ~ y" instead of x* we get from (2.15) that 

(2.21) 2/.(F(y") - F(x"+i)) > - y"f > 0. 

On the other hand, for n S Jc, (2.21) holds trivially because x""''^ = y"; thus (2.21) holds for all n. 
Adding (2.19) and (2.21) and adding (2.20) and (2.21), respectively, yield 

(2.22) 2n{F{y") - F{y"+^)) > and 2n{F{x") - F{x"+^)) > 0, for all n. 

The inequalities (2.22) show that the sequences of function values F{y") and F(a;") are non-increasing. Thus 
we have, 

fc-i 

(2.23) ^F{y''+^)>kF{y'') and ^ F(a;"+i) > fc„F(a;'=). 

n=0 rte/ 

Combining (2.18) and (2.23) yields 

(2.24) 2fi ((fc + K)Fix*) - k„F{x'') - kF{y'')) > - x*\\\ 
Hence, since F{y^) < F{x^), 

2/x(fc + fc„) (i^(/) - F{x*)) < \\x" - x*W\ 
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which gives us the desired result (2.13). □ 

Corollary 2.4. Assume V/ and V g are both Lipschitz continuous with Lipschitz constants L{f) and 
L{g), respectively. For fi < min{l/L(/), 1/L{g)}, Algorithm 3 satisfies 

(2.25) F{y'')-F{x*)< ""'^'^f Vfc, 

where x* is an optimal solution of (1.1). Thus sequence {F{y^y\ produced by Algorithm 3 converges to F{x*). 
Moreover, if 1/ {/3 'niax{L{f), L{g)}) < fi < 1/ max{L{f), L{g)} where /? > 1, the number of iterations needed 
to get an e-optimal solution is at most [C/e] , where C = l3m.eix{L(f), L(g)}\\x'^ — x*\\'^/4. 
Proof. The conclusion follows from Theorems 2.1 and 2.3 and kn = k. □ 

Remark 2.5. The complexity bound in Corollary 2.4 is smaller than the analogous bound for ISTA 
in [4] by a factor of two. It is easy to see that the bound in Theorem 2.3 is also an improvement over the 
bound in [4] as long as F{x'^'^^) < C^lx^'^^ ,y^\\^) holds for at least one value of k. It is reasonable then 
to ask if the per-iteration cost of Algorithms 3 and 4 are comparable to that of ISTA. It is indeed the case 
when the assumption holds that minimizing Ci^i,{x^y^;\^) has comparable cost (and often involves the same 
computations) as computing the gradient V/(y'^). 

Remark 2.6. More general problems of the form 

min f{x) + g{y) 
s.t. Ax + y = b 

are easily handled by our approach, since one can express (2.26) as 

min f{x)+g{b — Ax). 



(2.26) 



Remark 2.7. // a convex constraint x & C, where C is a convex set is added to problem (1.1), and we 
impose this constraint in the two subproblems in Algorithms 3 and 4, i-e., we impose x Cz C in the subproblems 
with respect to x and y (z C in the subproblems with respect to y, the complexity results in Theorem 2.3 and 
Corollary 2.4 continue to hold. The only changes in the proof are in Lemma 2.2. If there is a constraint 
X £ C, then (2.6) holds for any u € C and v Cz C. Also in the proof of Lemma 2.2, the first equality in (2.12) 
becomes a ">" inequality due to the fact that the optimality conditions (2.10) become 

{l4,{Pip{v)) +l^{v) + -{p^{v) - v),u- p^,{v)) > Q,\/u E C. 



Remark 2.8. Although Algorithms 3 and 4 assume that the Lipschitz constants are known, and hence 
that an upper bound for fi is known, this can be relaxed by using the backtracking technique in [4] to estimate 
fi at each iteration. 

3. Fast Alternating Linearization Methods. In this section, we propose a fast alternating lin- 
earization method (FALM) which computes an e-optimal solution to problem (1.1) in 0{y^L/e) iterations, 
while keeping the work at each iteration almost the same as that required by ALM. 

FALM is an accelerated version of ALM for solving (1.1), or equivalently (1.8), when f{x) and g{x) are 
both differentiable, and is given below as Algorithm 6. Clearly, FALM is also a Gauss-Seidel type algorithm. 
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In fact, it is a successive over-relaxation type algorithm since (tk — l)/tA:+i > 0,Vfc > 2. 

Algorithm 6: Fast Alternating Linearization Method (FALM) 

1 Choose /i and — — z^, set ti = 1. 

2 for k ~ 1,2, ■ ■ ■ do 



X argmin^; Qg{x,z ) 
:= argminj^(5/(y,a;'') 
tk+i := (1 + yr+4t2)/2 



Algorithm 6 requires both / and g to be continuously differentiable. To develop an algorithm that can 
be applied to problems where one of the functions is non-differentiable, we use a skipping technique as in 
Algorithm 4. FALM with skipping steps (FALM-S), which does not require g{x) to be smooth, is given 
below as Algorithm 7. 



Algorithm 7: FALM with Skipping Steps (FALM-S) 



1 Choose x'^ — y'^ — and e —dg{z^), set ti = 1. 

2 for A: = 1, 2, • • • do 
x^ := argmiuj. C^{x, z^; X^) 
if F{x'') > A'^) then 

if x-step was not skipped at iteration k-1 then 

tk ■■= (l 
else 
tk : 



l + 8fti /2 



l + Jl + AtlA/2 



^k 



.k-1 



X := z 
k — „,,„ n Jr^k 



V 



k-2\ 



y- := argmiuj^ Qf{x'',y) 
if x'^ ~ z'^ then 



tk+i ■■= 1 



else 



v/r+2ifj/2 
tk+i := (l + yiTlil) /2 



yk+l 



tk-1 



,k-l\ 



Choose A^-'+i e ~dg{z''+^) 



The following theorem gives conditions under which Algorithms 6 and 7 are equivalent. 

Theorem 3.1. // both f{x) and g(x) are differentiable and \7g(x) is Lipschitz continuous with Lipschitz 
constant L{g), and fi < 1/L{g), then Algorithms 6 and 7 are equivalent. 

Proof. As in Theorem 2.1, if / and g are differentiable, C^{x, z^; \^) = Qg{x, z''). The conclusion then 
follows from the fact that F{x^) < C^i{x'^ , z^; X'') always holds when /i < 1/L{g) and thus there are no 
skipping steps. □ 

To prove that Algorithm 7 requires 0{^J L(f) / e) iterations to obtain an e-optimal solution, we need the 
following lemmas. We call fc-th iteration a skipping step if x^ = z^ , and a regular step if x^ ^ z^ . 
Lemma 3.2. The sequence {x^,y^} generated by Algorithm 7 satisfies 

(3.1) 2^x{tlvk ~ tl+,Vk+i) > Wu'^+Y - 
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where := tk.y^ — {tk — ^)y^ ^ ^ x* and Vk ■= 2F{y^) — 2F{x*) if iteration k is a regular step and 
Vk ■= F{y^) — F{x*) if iteration k is a skipping step. 

Proof. There are four cases to consider: (i) both the fc-th and the (fc + l)-st iterations are regular steps; 
(ii) the fc-th iteration is a regular step and the (fc + l)-st iteration is a skipping step; (iii) both the fc-th 
and the (fc -I- l)-st iterations are skipping steps; (iv) the fc-th iteration is a skipping step and the (fc -I- l)-st 
iteration is a regular step. We will prove that the following inequality holds for all the four cases: 



{6.2) 



>tk+^{tk+, 1) (11/^+^ - y'r \\z'^+' /IP) + tfe+i (11/+^ - - "-'^■+^ 



The proof of (3.1) and hence, the lemma, then follows from the fact that the right hand side of inequality 

(3.2) equals 

Pfe+l/+l - {tk+l - 1)/ - x*f - \\tk+lZ^+' - (tk+l - 1)/ -x*f = \\u''+Y - Wu'^f, 

where we have used the fact that tk+iz''^^ := tk+iy^ + tk{y^ — y^^^) — {y^ — y^^^)- 

Case (i): Let us first consider the case when both the fc-th and (fc + l)-st iterations are regular steps. In 
(2.6), by letting -0 = /, = g, u = y^ and v = x^^^ , wc get p,f,{v) = y''^^ , $ = F and 

(3.3) 2/x(i^(/) - F(/+i)) > - - - /f. 

In (2.6), by letting ip = g, (j> = f , u = y^ , v = z'"'+^, we get p,p{v) = x''^^ , ^ = F and 

(3.4) 2fi{F{y') ~ F{x''+^)) > \\x'+^ - /f - \\z'+^ - /f. 
Summing (3.3) and (3.4), and using the fact that F{y^+^) < F{x''+'^), wc obtain, 

(3.5) 2^,{vk - Vk+i) = 2M(2i^(/) - 2i^(/+i)) > ||/+i - - \\z''+^ /f 
Again, in (2.6), by letting ip~g,(p = f,u = x*,v^ z^^^ , wc get p^{v) ~ a;'^+^, ^ — F and 

(3.6) 2^i{F{x*) - F{x''+^)) > llx'^+i - x*\\^ - \\z''+^ - x*\\^. 

In (2.6), by letting ip = f , <p = g, u ^ x* , v = x'^^^ , we get p^{v) = y*^^"^, $ = F and 

(3.7) 2^l{F{x*) - F(/+i) > -x*f- - x*f. 
Summing (3.6) and (3.7), and again using the fact that F(/+i) < F{x''+'^), we obtain, 

(3.8) -2iivk+i = 2fi{2Fix*) - 2F(/+i)) > \\y'^+' -x*f- \\z''+' - x*\\\ 

If we multiply (3.5) by t^, and (3.8) by tk+i, and take the sum of the resulting two inequalities, we get (3.2) 
by using the fact that t| = tk+i{tk+i — 1). 

Case (ii): By letting ilj = f,4> = g,u = y'' and v = z^'^^ in (2.6), we get p^{v) = y^^^ , $ = F and 

(3.9) 2/i(F(/) - F(/+i)) > - /IP - ||z^-+i - /Ip. 
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Since the steps taken in the k-th and (fc + l)-st iterations are regular and skipping, respectively, we have 

(3.10) 2/i - = 2/.(F(/) - > - /IP - - z/lp 
Also by letting ip = f , (f) = g, u = x* and v = z''^^ in (2.6), we get p^{v) = y^^^ , $ = F and 

(3.11) ~2^iVk+l = 2p.{F{x*) - Fiy^+')) > -x*\\'- \\z'^+' - x*\\'. 

Then multiplying (3.10) by 2^^, (3.11) by tfc+i, summing the resulting two inequalities and using the fact 
that in this case 2t^ = tk+i{tk+i — 1), we obtain (3.2). 

Case (iii): This case reduces to two consecutive FISTA steps and the proof above applies with = 
tk+i{tk+i — 1) and inequality (3.10) replaced by 

(3.12) 2^,{v, ^ vk+i) = 2m(F(/) - F(y^+i)) > - y'^f - \\z'^+' 

which gets multiplied by tj,. 

Case (iv): In this case, (3.5) in the proof of case (i) is replaced by 

(3.13) 2^i{2vk - «fe+i) - 2Ai(2F(/) - 2F{y^+')) > \\y^+' -y'f- \\z'+' -y^f 

which when multiplied by tf./2 and combined with (3.8) multiplied by tk+i, and the fact that in this case 
tl/2 = tk+iitk+i - 1), yields (3.2). □ 

The following lemma gives lower bounds for the sequence of scalars {tk} generated by Algorithm 7. 

Lemma 3.3. For all k > 1 the sequence {tk} generated by Algorithm 7 satisfies: 
if the first step is a skipping step, 



■^{k + 1 + ar{k)) if k is a skipping step, 
ji=(fc + 1 + ar{k)) if k is a regular step, 



tk > j 

if the first step is a regular step, 

{.^(fc + 1 + ds(fc)) if k is a skipping step, 
i(fc + 1 + ds(fc)) if k is a regular step, 

where r{k) and s{k) are the number of steps among the first k steps that are regular and skipping steps, 
respectively, and a = \/2 — 1 and a = ^ ^■ 

Proof. Consider the case where the first iteration of Algorithm 7 is a skipping step. Clearly, the sequence 
of iterations follows a pattern of alternating blocks of one or more skipping steps and one or more regular 
steps. Let the index of the first iteration in the i-th block be denoted by n.^. Since it is assumed that the first 
iteration is a skipping step, iterations rii, 713, 715 . . . are skipping steps (ni = 1) and ri2, 71.4, rig .. . are regular 
steps. Note that the statement of the lemma in this case corresponds to 



(3.14) tk > 



^(fc + 1 + ar{k)), for uj < k < nj-\-i — 1, if j is odd, 
^^(fc + 1 + ar{k)), for uj < k < rij+i — 1, if j is even. 



which we will prove by induction on j . 
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We first note that it follows from the updating rules and formulas for tk that 



{i + \/2^fe-i, if fc is a skipping step and fc — 1 is a regular step, 
i + -^tk-i, if /c is a regular step and fc — 1 is a skipping step, 
otherwise. 

Consider j ^ I. Clearly (3.14) holds for all iterations ni = l</c<n2 — 1, since tk > holds trivially for 
ti = 1, and for 1< fc < 712 - 1, tfc > + = 

Now assume that (3.14) holds for all j < j. If j is even, iterations fc = nj and fc — 1 are, respectively, 
regular and skipping iterations. Hence, from (3.15), we have that 

t.>\ + ^t.-i > ^ + ^(fc + -rik - D) - ^(fc + 1 + -(fc)). 

Since the remaining p = nj^i — 1 — rij iterations before iteration tt-j+i are all regular iterations {p may be 
zero), we have from (3.15) that, for rij < k < jt-j+i — 1, 

tk > ^ + > ^ + ^{nj + 1 + arinj)) 

If j is odd, iteration fc = is a skipping iteration. Hence, from (3.15), we have that tfe > ^ + \/2tk-i > 
^ + ^/2^^{k + ar{k—l)) = i(fc + l + ar(fc)). Since the remaining^ = 7ij_|_]^ — 1 — iterations before iteration 
rij^i arc all skipping iterations (again p may be zero), we have from (3.15) that, for nj < fc < Tij+i — 1, 
tk > ^ 2^ + tn-. > — -j-^ + ^{nj + 1 + ar(nj)) = i(fc + 1 + ar(fc)). This concludes the induction. 

Since the proof for the case that the first step is a regular step is totally analogous to the above proof, 
we leave this to the reader. □ 

Now we are ready to give the complexity of Algorithm 7. 

Theorem 3.4. Let a = \/2 — 1 and r{k) be the number of steps among the first fc steps that are regular 
steps. Assuming V/(-) is Lipschitz continuous with Lipschitz constant L{f), if fi < 1/L{f), the sequence 
{y^} generated by Algorithm 7 satisfies: 



(3.16) i^(/)-F(x*)< 



//(fc + l + af(fc))2' 

where f{k) — r{k) if the first step is a skipping step, and f{k) = r(fc) + I if the first step is a regular step. 

Hence, the sequence {F{y^)} produced by Algorithm 4 converges to F{x*). Moreover, if 1/ {(3L{f)) < 
fi < 1/ L{f) where /? > 1, the number of iterations required by Algorithm 1 to get an e-optimal solution to 
(1.1) IS at most L^CTeJ, where C 2/3L{f)\\x" - x*\\^. 

Proof. Using the same notation as in Lemmas 3.2 and 3.3, (3.8) and (3.11) imply that 

^2^m>\\y'-x*\\'-\\z'-x*\\' 

holds whether the first iteration is a skipping step or not. Thus we have 

(3.17) 2^iVl + -x*f <\\z^ -x*f ^ - x*f. 
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From Lemma 3.2 we know that the sequence {2iJ,t^Vk + IW^W^} is non-increasmg. Therefore, we have 



(3.18) 2fitlvk < 2fitlvk + < 2fitlvi + Wu^W^ = 2idvi + \\y^ - x*\\^ < \\x° - x*\\^, 

where the equahty follows from the facts that <i = 1 and = i/ — x* , and the last inequality is from (3.17). 

Recall the definition of Vk in Lemma 3.2 and the bounds of tk in Lemma 3.3. We get (i) and (ii) from 
(3.18). Keeping in mind that Vk has a different expression depending on whether the fc-th step is a skipping 
or a regular step, it follows that the sequence {y''} generated by Algorithm 7 satisfies: 

(i) if the first step is a skipping step, then 

F{y')-F{x*)< 211-"- 



fi{k + 1 + ar{k)y 
(ii) if the first step is a regular step, then 

F{y'')-Fix*)< il^'"-^*"' 



/i(fc + 1 + ds(fc))2 ■ 

It is easy to check that these bounds are equivalent to (3.16), and that the worst case bound on the 
number of iterations follows from (3.16). □ 

Corollary 3.5. Assume V/ and \Jg are both Lipschitz continuous with Lipschitz constants L{f) and 
L{g), respectively. For fi < min{l/L(/), 1/L{g)}, Algorithm 6 satisfies 

(3.19) 

where x* is an optimal solution of (1.1). Hence, the sequence {F{y^y\ produced by Algorithm 6 converges 
to F{x*), and i/ l/(/3max{i(/),i(g)}) < M < 1/ niax{iy(/), L(g)}, where /3 > 1, the number of iterations 
needed to get an e-optimal solution is at most \yjcje — 1] , where C ~ /3 max{L(/), L{g)}\\x^ — x*\\'^ . 

Proof. Note that since fi < min{l/L(/), l/L{g)}, from Theorem 3.1 we know that Algorithms 6 and 7 
are equivalent. That is, every step in Algorithm 7 is a regular step. Therefore, case (ii) in Theorem 3.4 holds 
and s{k) = 0, which leads to (3.19). □ 

Remark 3.6. The complexity bound in Corollary 3.5 is smaller than the analogous bound for FIST A in 
[4] by a factor ofV2. It is easy to see that the bound in Theorem 3.4 is also an improvement over the bound 
in [4] as long as F{x''^^) < Cfj_{x^^^ ,y^]\^) holds for at least one value ofk. As in the case of Algorithms 
3 and 4, the per-iteration cost of Algorithms 6 and 7 are comparable to that of FISTA. 

Remark 3.7. Line 6 in Algorithm 6 and Line 15 in Algorithm 7 can be changed to: 

^fc+i := yj^ + -^[tkiy'' - w'-i) - iw'^ - w''-% 
tk+i 

where := ax^ + (1 — a)y^ ,a G (0, 1), and Theorem 3.4 and Corollary 3.5 still hold. 

Remark 3.8. Although Algorithms 6 and 7, assume that the Lipschitz constants are known, and hence 
that an upper bound for fi is known, this can be relaxed by using the backtracking technique in [4] to estimate 
fi at each iteration. 

4. Comparison of ALM, FALM, ISTA, FISTA, SADAL and SALSA. In this section we com- 
pare the performance of our basic and fast ALMs, with and without skipping steps, against ISTA, FISTA, 
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SADAL (Algorithm 2) and an alternating direction augmented Lagrangian method SALSA described in [2] 
on a benchmark wavelet-based image deblurring problem from [17]. In this problem, the original image is 
the well-known Cameraman image of size 256 x 256 and the observed image is obtained after imposing a 
uniform blur of size 9x9 (denoted by the operator R) and Gaussian noise (generated by the function randn 
in MATLAB with a seed of and a standard deviation of 0.56). Since the coefficient of the wavelet transform 
of the image is sparse in this problem, one can try to reconstruct the image u from the observed image h by 
solving the problem: 

(4.1) x:=argmin ^Ma; — &II2 + PlI^^Hi, 

X 2 

and setting u := Wx, where A := RW and W is the inverse discrete Haar wavelet transform with four 
levels. By defining f{x) := ^\\Ax — and g{x) :— p\\x\\i, it is clear that (4.1) can be expressed in the 
form of (1.1) and can be solved by ALM-S (Algorithm 4), FALM-S (Algorithm 7), ISTA, FISTA, SALSA 
and SADAL (Algorithm 2). However, in order to use ALM (Algorithm 3) and FALM (Algorithm 6), we 
need to smooth g{x) first, since these two algorithms require both / and g to be smooth. Here we apply 
the smoothing technique introduced by Nesterov [39] since this technique guarantees that the gradient of 
the smoothed function is Lipschitz continuous. A smoothed approximation to the £1 function g{x) := /o||a;||i 
with smoothness parameter cr > is 

(4.2) ga{x) := max{(a;,z) - ^\\z\\l : \\z\\oo < p}- 



It is easy to show that the optimal solution Zc{x) of (4.2) is 

(4.3) Zc{x) = min{p, max{a;/cr, — p}}. 

According to Theorem 1 in [39], the gradient of g^ is given by Vga-{x) — z„{x) and is Lipschitz continuous 
with Lipschitz constant L(g^) — 1/a. After smoothing g, we can apply Algorithms 3 and 6 to solve the 
smoothed problem: 

(4.4) minf{x)+g„{x). 

We have the following theorem about the e-optimal solutions of problems (4.1) and (4.4). 

Theorem 4.1. Let a ~ and e > 0. If x{a) is an e/2-optimal solution to (4.4), then x{a) is an 
e-optimal solution to (4.1). 

Proof. Let Dg :~ max{i||2:||2 : H-zHoo < p} = ^^P^ ^i^d x* and x*{a) be optimal solution to problems 
(4.1) and (4.4), respectively. Note that 

(4.5) g,{x) < g{x) < g,{x)+aDgyx G R". 



Using the inequalities in (4.5) and the facts that x{(j) is an e/2-optimal solution to (4.4) and crDg = |, we 
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have 



f{x{a)) + g{x{a)) - f{x*) - g[x*) < f{x{a)) + ga{x{a)) + aDg - f{x*) - g,[x*) 

< f{x{<j))+g„{x{<j))+aDg- f{x*{a))~g„{x*{cj)) 

< e/2 + (iDg = e. □ 

Thus, to find an e-optimal solution to (4.1), we can apply Algorithms 3 and 6 to find an e/2-optimal 
solution to (4.4) with a = ^j^. The iteration complexity results in Corollaries 2.4 and 3.5 hold since the 
gradient of g^ is Lipschitz continuous. However, the numbers of iterations needed by ALM and FALM to 
obtain an e-optimal solution to (4.1) become 0(l/e^) and 0(l/e), respectively, due to the fact that the 

2 

Lipschitz constant L^g^) ~ l/a ~ ^ = 0(l/e). 

When Algorithms 4 and 7 are applied to solve (4.1), the subproblems (1.2) and (1.3) are easy to solve. 
Specifically, (1.2) corresponds to solving a linear system which is particularly easy to do because of the special 
structures of R and W (see [1, 2]); (1.3) corresponds to a vector shrinkage operation. When Algorithms 3 
and 6 are applied to solve (4.4), (1.3) with g replaced by g^ is also easy to solve; its optimal solution is 

X := z — r min{p, max{— p, }}. 

T + cr 

Since ALM is equivalent to SADAL when both functions are smooth, we implemented ALM as SADAL 
when we solved (4.4). We also applied SADAL to the nonsmooth problem (4.1). We also implemented ALM- 
S as Algorithm 5 since the latter was usually faster. In all algorithms, we set the initial points x^ = iP = 0, 
and in FALM and FALM-S we set z^ = 0. MATLAB codes for SALSA, FISTA and ISTA (modified from 
FISTA) were downloaded from http://easeais.lx.it.pt/~mafonso/salsa.html and their default inputs were 
used. Moreover, A° was set to in algorithms 2, 4, 5 and 7 since Vgcr(x'^) — and G —dg{x'^) when 
= 0. Also, whenever g{x) was smoothed, we set a = 10~^. ^ was set to 1 in all the algorithms since the 
Lipschitz constant of the gradient of function i||i?W^(-) — b\\2 was known to be 1. We set /x to 1 even for 
the smoothed problems. Although this violates the requirement fi < jj^-^ in Corollaries 2.4 and 3.5, we see 
from our numerical results reported below that ALM and FALM still work very well. All of the algorithms 
tested were terminated after 1000 iterations. The (nonsmoothed) objective function values in (4.1) produced 
by these algorithms at iterations: 10, 50, 100, 200, 500, 800 and 1000 for different choices of p are presented 
in Tables 4.1 and 4.2. The CPU times (in seconds) and the number of iterations required to reduce the 
objective function value to below 1.04e + 5 and 8.60e + 5 arc reported respectively in the last columns of 
Tables 4.1 and 4.2. 

All of our codes were written in MATLAB and run in MATLAB 7.3.0 on a Dell Precision 670 workstation 
with an Intel Xeon(TM) 3.4GHZ CPU and 6GB of RAM. 

Table 4.1 

Comparison of the algorithms for solving (4.1) with p = 0.01 



solver 


obj in k-th iteration 


cpu (iter) 




10 


.50 


100 


200 


500 


800 


1000 




FALM-S 


1.767239e+5 


1.040919e+5 


1.004322C+5 


9.7265990+4 


9.3412820+4 


9.182962e+4 


9.121742e+4 


24.3 (51) 


FALM 


1.767249e+5 


1.040955e+5 


9.8998430+4 


9.516208O+4 


9.1863550+4 


9.073086e+4 


9.028790e+4 


23.1 (51) 


FISTA 


1.723109e-(-5 


1.061116e+5 


1.016385O+5 


9.7528580+4 


9.372093O+4 


9.233719e+4 


9.178455e+4 


26.0 (69) 


ALM-S 


4.218082e-(-5 


1.439742e-(-5 


1.2128650+5 


1.107103O+5 


1.042869O+5 


1.021905e+5 


1.013128e+5 


208.9 (531) 


ALM 


4.585705e-(-5 


1.481379e+5 


1.2331820+5 


1.1166830+5 


1.047410O+5 


1.025611e+5 


1.016589e+5 


208.1 (581) 


ISTA 


2.345290e-|-5 


1.267048e+5 


1.1378270+5 


1.079721O+5 


1.040666O+5 


1.025107e+5 


1.018068e+5 


196.8 (510) 


SALSA 


8.772957e-|-5 


1.549462e4-5 


1.2673790+5 


1.1326760+5 


1.054600O+5 


1.031346e+5 


1.021898e+5 


223.9 (663) 


SADAL 


2.524912e-|-5 


1.271591e+5 


1.1335420+5 


1.068386O+5 


1.021905O+5 


1.004005e+5 


9.961905e+4 


113.5 (332) 
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Table 4.2 

Comparison of the algorithms for solving (4.1) with p = 0.1 



solver 


obj in /c-th iteration 


cpu (iter) 




10 


50 


100 


200 


500 


800 


1000 




FALM-S 


9.868574e+5 


8.771604e+5 


8.487372e+5 


8.271496e+5 


8.110211e+5 


8.065750e+5 


8.050973e+5 


37.7 (76) 


FALM 


9.876315e+5 


8.629257e+5 


8.369244e+5 


8.210375e+5 


8.097621e+5 


8.067903e+5 


8.058290e+5 


25.7 (54) 


FISTA 


9.924884e+5 


8.830263e+5 


8.501727e+5 


8.288459e+5 


8.126598e+5 


8.081259e+5 


8.066060e+5 


30.1 (79) 


ALM-S 


1.227588e+6 


9.468694e+5 


9.134766e+5 


8.880703e+5 


8.617264e+5 


8.509737e+5 


8.465260e+5 


214.5 (537) 


ALM 


1.263787e+6 


9.521381e+5 


9.172737e+5 


8.910902e+5 


8.639917e+5 


8.528932e+5 


8.482666e+5 


211.8 (588) 


ISTA 


1.048956e+6 


9.396822e+5 


9.161787e+5 


8.951970e+5 


8.700864e+5 


8.589587e+5 


8.541664e+5 


293.8 (764) 


SALSA 


1.680608e+6 


9.601661e+5 


9.230268e+5 


8.956607e+5 


8.674579e+5 


8.558580e+5 


8.509770e+5 


230.2 (671) 


SADAL 


1.060130e+6 


9.231803e+5 


8.956150e+5 


8.735746e+5 


8.509601e+5 


8.420295e+5 


8.383270e+5 


112.5 (335) 



From Tables 4.1 and 4.2 we see that in terms of the value of the objective function achieved after a 
specified number of iterations, the performance of FALM-S and FALM is always slightly better than that of 
FISTA and is much better than the performance of the other algorithms. On the two test problems, since 
FALM-S and FALM are always better than ALM-S and ALM, and FISTA is always better than ISTA, we 
can conclude that the Nesterov-type acceleration technique greatly speeds up the basic algorithms on these 
problems. Moreover, although sometimes in the early iterations FISTA (ISTA) is better than FALM-S and 
FALM (ALM-S and ALM), it is always worse than the latter two algorithms when the iteration number is 
large. We also illustrate our comparisons graphically by plotting in Figure 4.1 the objective function value 
versus the number of iterations taken by these algorithms for solving (4.1) with p = 0.1. From Figure 4.1 we 
see clearly that for this problem, ALM outperforms ISTA, FALM outperforms FISTA, FALM outperforms 
ALM and FALM-S outperforms ALM-S. 

From the CPU times and the iteration numbers in the last columns of Tables 4.1 and 4.2 we see that, 
the fast versions arc always much better than the basic versions of the algorithms. Since iterations of FISTA 
cost less than those of FALM-S (and FALM as weU), we see that although FISTA takes 35% (4%) more 
iterations than FALM-S in the last column of Table 4.1 (4.2) it takes only 7% more time (20% less time). 

We note that for the problems with p = 0.01 and p = 0.1, (i.e., for the results given in Tables 4.1 and 
4.2), 891 and 981, respectively, of the first 1000 iterations performed by FALM-S were skipping steps. In 
contrast, none of the steps performed by ALM-S on either of these problems were skipping steps. While the 
latter result is somewhat surprising, the fact that FALM-S performs many skipping steps is not, since the 
Nesterov-like acceleration approach is an over- relaxation approach that generates points that extrapolate 
beyond the previous point and the one produced by the ALM algorithm. 

5. Applications. In this section, we describe how ALM and FALM can be applied to problems that 
can be formulated as RPCA problems to illustrate the use of Nesterov-type smoothing when the functions / 
and g do not satisfy the smoothness conditions required by the theorems in Sections 2 and 3. Our numerical 
results show that our methods are able to solve huge problems that arise in practice; e.g., one problem 
involving roughly 40 million variables and 20 million linear constraints is solved in about three-quarters of 
an hour. We alse describe application of our methods to the SICS problem. 

5.1. Applications in Robust Principal Component Analysis. In order to apply Algorithms 3 
and 6 to (1.6), we need to smooth both the nuclear norm f{X) := \\X\\^, and the li norm g{Y) 
We again apply Nesterov's smoothing technique as in Section 4. g{Y) can be smoothed in the same way as 
the vector £i norm in Section 4. We use gaiX) to denote the smoothed function with smoothness parameter 
(7 > 0. A smoothed approximation to f{X) with smoothness parameter cr > is 



(5.1) 



f.{X) max{(X, W) - -\\W\\l : \\W\\ < 1}. 
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Fig. 4.1. comparison of the algorithms 

It is easy to show that the optimal solution Wa-{X) of (5.1) is 

(5.2) W^{X) = J7Diag(min{7, 1})F^, 

where J7Diag(7)T^^ is the singular value decomposition (SVD) of X/a. According to Theorem 1 in [39], the 
gradient of /ct is given by V/cr(^) = Wcr{X) and is Lipschitz continuous with Lipschitz constant i(/o-) = l/cr. 
After smoothing / and g, we can apply Algorithms 3 and 6 to solve the following smoothed problem: 

(5.3) mm{UiX)+g,iY):X + Y = M}. 

We have the following theorem about e-optimal solutions of problems (1.6) and (5.3). 



Theorem 5.1. Let a 



and e > 0. // {X{a),Y{a)) is an e/2-optimal solution 



2 max{ inin{7n, n} ,mn 

(5.3), then {X{a),Y(a)) is an e-optimal solution to (1.6). 

Proof. Let Df := max{i||M^|||, : < 1} = imin{m,7i}, Dg := max{i|lZ|||, : ||Z||oo < p} 
and {X*,Y*) and {X* {a),Y* {a}) be optimal solution to problems (1.6) and (5.3), respectively. Note that 



to 



(5.4) 



MX) < f{X) < MX) + aDf,yx e 
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and 



(5.5) g„{Y) < g{Y) < g„{Y)+aDg,yY e M"><". 

Using the inequalities in (5.4) and (5.5) and the facts that {X{a),Y{a)) is an e/2-optimal solution to (5.3) 
and amsx.{Df, Dg} = |, we have 

fiX{a)) + g{Y{a)) - f{X*) - g{Y*) < UX{a)) + ,9.(r(a)) + aDf + aD^ - ,f„{X*) - g„{Y*) 

< U{X{ct)) + gAYi<j)) + aDf + aDg - fAX*{a)) - ga{Y*{a)) 
<e/2 + aDf + aDg<e/2 + e/A + e/A = e. □ 

Thus, according to Theorem 5.1, to find an e-optimal solution to (1.6), we need to find an e/2-optinial 
solution to (5.3) with a = ■, — ^-r — i it- We can either apply Algorithms 3 and 6 to solve (5.3), 

^ ' 2 max|min|m, n},mnp^| i: r- ^ o \ /i 

or apply Algorithms 4 and 7 to solve (5.3) with only one functions (say f{x)) smoothed. The iteration 
complexity results in Theorems 2.3 and 3.4 hold since the gradients of fa is Lipschitz continuous. However, 
the numbers of iterations needed by ALM and FALM to obtain an e-optimal solution to (1.6) become 0(l/e^) 
and 0(l/e), respectively, due to the fact that the Lipschitz constant L{J„) = l/cr = } _ 

0(l/e). 

The two subproblems at iteration k of Algorithm 3 when applied to (5.3) reduce to 

(5.6) := argmin/^(X) + g^iY^) + (V.9<,(y'^'), M - X -Y^) + ^\\X + Y'' - M"^ 



and 



(5.7) r'^+i := argmin/,(X'=+i) + (V/<,(X'=+i), M - X^+i - Y) + +Y - AI\\l + g^iY). 
The first-order optimality conditions for (5.6) are: 

(5.8) Wa{X)~ZaiY'') + -iX + Y''-M) = 0, 
where WdX) and Z„{Y) are defined in (5.2) and (4.3). It is easy to check that 

(5.9) X C/Diag(7 ^ . )V^ 

satisfies (5.8), where t/Diag(7)l/^ is the SVD of the matrix ^Za{Y^)~Y^ +M . Thus, solving the subproblcm 
(5.6) corresponds to an SVD. If we define B := /iWo-(A''^+^) — X^'^^ + M, it is easy to verify that 

]3 ■ ■ 

(5.10) Yij — Bij — /j,min{p, max{— p, — -^}} for i = 1, . . . , m and j = 1, . . . , n 

(T + /i 

satisfies the first-order optimality conditions for (5.7): -WJX^+^) + i(X'=+i + Y - M) ^ Za{Y) = 0. 
Thus, solving the subproblem (5.7) can be done very cheaply. The two subproblems at the k-ih. iteration of 
Algorithm 6 can be done in the same way and the main computational effort in each iteration of both ALM 
and FALM corresponds to an SVD. 

21 



5.2. RPCA with Missing Data. In sonic applications of RPCA, some of the entries of M in (1.6) 
may be missing (e.g., in low-rank matrix completion problems where the matrix is corrupted by noise). 
Let VI be the index set of the entries of M that are observable and define the projection operator Va as: 
{VQ,{X))ij = Xij, if (i,j) e Vl and {VQ.{X))ij = otherwise. It has been shown under some randomness 
hypotheses that the low rank X and sparse Y can be recovered with high probability by solving (see Theorem 
1.2 in [6]), 

(5.11) {XX) :=argmin{||Xi|,+p||y||i : Vn{X + Y) = Vn{M)} . 

To solve (5.11) by ALM or FALM, we need to transform it into the form of (1.6). For this we have 
Theorem 5.2. {X,'Pn{y)) is an optimal solution to (5.11) ij 

(5.12) (A, Y) = argmin{|lAlU + p\\Vn{Y)h : X + Y = Vn{M)}. 

Proof. Suppose {X*,Y*) is an optimal solution to (5.11). We claim that Y*j = 0,V(i, j) ^ il. Otherwise, 
{X* jVniY*)) is feasible to (5.11) and has a strictly smaller objective function value than {X*,Y*)j which 
contradicts the optimality of {X*,Y*). Thus, |l'Pn(i^*)i| i = \\Y*\\i. Now suppose that (X,VniY)) is not 
optimal to (5.11); then we have 

(5.13) \\X*\U+p\\rn{Y*)\\i = ||A*||,+p||y*||i < \\X\\*+p\\Vn{Y)\\i- 
By defining a new matrix Y as 

\ -A*, {^,J)in, 

we have that {X*,Y) is feasible to (5.12) and |l'Po(F)|li = \\'PniY*)\\i. Combining this with (5.13), we 
obtain 

\\X*\U + p\\rn{Y)h < \\X\\* + PWPnXh, 

which contradicts the optimality of {X,Y) to (5.12). Therefore, {X ,Vn{Y)) is optimal to (5.11). □ 

The only differences between (1.6) and (5.12) lie in that the matrix M is replaced by Vn{M) and 
giY) = p\\Y\\i_ is replaced by p\\Vn{Y)\\i. A smoothed approximation ga{Y) to giY) := p\\Vn{Y)\\i is given 

by 

(5.14) ga{Y) max{{rn{Y), Z) - \\Z\\ : ||Z|U < p}, 
and 

(5.15) (V.g^(y)),j = min{p, max{CPo(r))y/a, -p}}, for 1 < ^ < m and 1 < j < n. 



According to Theorem 1 in [39], V5cr(y) is Lipschitz continuous with La[g) = 1/a. Thus the convergence 
and iteration complexity results in Theorems 2.3 and 3.4 apply. The only changes in Algorithms 3 and 4 
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and Algorithms 6 and 7 are: replacing M by Vn{M) and computing y'^+i using (5.10) with B is replaced 



5.3. Numerical Results on RPCA Problems. In this section, we report numerical results obtained 
using the ALM method to solve RPCA problems with both complete and incomplete data matrices M. We 
compare the performance of ALM with the exact ADM (EADM) and the inexact ADM (lADM) methods 
in [31]. The MATLAB codes of EADM and lADM were downloaded from http : / /watt. csl.illinois.edu/ ~ 
perceive / matrix — rank / sample jzode.html and their default settings were used. To further accelerate ALM, 
we adopted the continuation strategy used in EADM and lADM. Specifically, we set /-tfc+i := max{/i, 77/^^}, 
where /io = ||A/||/1.25, = 10^^ and ri = 2/3 in our numerical experiments. Although in some iterations 
this violates the requirement /i < min{-^^, ^ } in Corollaries 2.4 and 3.5, we see from our numerical 
results reported below that ALM and FALM still work very well. We also found that by adopting this 
updating rule for /x, there was not much difference between the performance of ALM and that of FALM. 
So we only compare ALM with EADM and lADM. As in Section 4, since we applied ALM to a smoothed 
problem, we implemented ALM as SADAL. The initial point in ALM was set to (X", F") = (A/, 0) and the 
initial Lagrange multiplier was set to A" = —Vga{Y^). We set the smoothness parameter a ~ 10"^. Solving 
subproblem (5.6) requires computing an SVD (see (5.9)). However, we do not have to compute the whole 
SVD, as only the singular values that are larger than the threshold r ~ max{7?;^+g-} ^'^"-^ corresponding 
singular vectors are needed. We therefore use PRO PACK [29], which is also used in EADM and lADM, to 
compute these singular values and corresponding singular vectors. To use PROPACK, one has to specify 
the number of leading singular values (denoted by svk) to be computed at iteration k. We here adopt the 
strategy suggested in [31] for EADM and lADM. This strategy starts with swo = 100 and updates svk via: 



where d = min{TO, 71} and svp^ is the number of singular values that are larger than the threshold r. 

In all our experiments p was chosen equal to l/^m. We stopped ALM, EADM and lADM when the 
relative infeasibility was less than 10"^ i.e., \\X + Y - M\\f < 10-^||M||f. 

5.3.1. Background Extraction from Surveillance Video. Extracting the almost still background 
from a sequence frames of video is a basic task in video surveillance. This problem is difficult due to the 
presence of moving foregrounds in the video. Interestingly, as shown in [6], this problem can be formulated 
as a RPCA problem (1.6). By stacking the columns of each frame into a long vector, we get a matrix M 
whose columns correspond to the sequence of frames of the video. This matrix M can be decomposed into 
the sum of two matrices M := X + Y. The matrix X, which represents the background in the frames, should 
be of low rank due to the correlation between frames. The matrix Y , which represents the moving objects 
in the foreground in the frames, should be sparse since these objects usually occupy a small portion of each 
frame. We apply ALM to solve (1.6) for two videos introduced in [30]. 

Our first example is a sequence of 200 grayscale frames of size 144 x 176 from a video of a hall at 
an airport. Thus the matrix M is in ]j25344x200^ rpj^g second example is a sequence of 320 color frames 
from a video taken at a campus. Since the video is colored, each frame is an image stored in the RGB 
format, which is a 128 x 160 x 3 cube. The video is then reshaped into a 128 x 160 by 3 x 320 matrix, 
i.e., M £ j^20480x960^ Some frames of the videos and the recovered backgrounds and foregrounds are shown 
in Figure 5.3.1. We only show the frames produced by ALM, because EADM and lADM produce visually 
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(a) (b) (c) (a) (h) (c) 




Fig. 5.1. In the first 3 columns: (a) Video sequence, (b) Static background recovered by our ALM. Note that the man who 
kept still in the 200 frames stays as in the background, (c) Moving foreground recovered by our ALM. In the last 3 columns: 
(a) Video sequence, (b) Static background recovered by our ALM. (c) Moving foreground recovered by our ALM. 

identical results. From these figures we can see that ALM can effectively separate the nearly still background 
from the moving foreground. Table 5.1 summarizes the numerical results on these problems. The CPU times 
are reported in the form of hh : mm : ss. From Table 5.1 wc see that although ALM is slightly worse than 
lADM, it is much faster than EADM in terms of both the number of SVDs and CPU times. Wc note that 
the numerical results in [6] show that the model (1.6) produces much better results than other competing 
models for background extraction in surveillance video. 



Table 5.1 

Comparison of ALM and EADM on surveillance video problems 





Exact ADM 


Inexact ADM 


ALM 


Problem m n 


SVDs CPU 


SVDs CPU 


SVDs CPU 


Hall (gray) 25344 200 


550 40:15 


38 03:47 


43 04:03 


Campus (color) 20480 960 


651 13:54:38 


40 43:35 


46 46:49 



5.3.2. Random Matrix Completion Problems with Grossly Corrupted Data. For the matrix 
completion problem (5.11), we set M := A + E, where the rank r matrix A S R"^" was created as the 
product AlAJi^, of random matrices Al € R"^'' and Ar £ R"^*" with i.i.d. Gaussian entries Af{0, 1) and 
the sparse matrix E was generated by choosing its support uniformly at random and its nonzero entries 
uniformly i.i.d. in the interval [—500,500]. In Table 5.2, rr :— rank(^)/n, spr :— \\E\\o/n'^, the relative 
errors relX := \\X — A\\f/\\A\\f and relY := ||y — i?||i?/||£'||i?, and the samphng ratio of fl, SR ~ m/m?. 
The m indices in 17 were generated uniformly at random. We set p = l/v^ ^-nd stopped ALM when 
the relative infeasibility \\X + Y ~ Vai^)]] p / WVniM)]] p < 10~^ and for our continuation strategy, wc set 
/io = ||7'o(^-'^)1|f/1-25. The test results obtained using ALM to solve (5.12) with the nonsmooth functions 
replaced by their smoothed approximations are given in Table 5.2. From Table 5.2 we see that ALM recovered 
the test matrices from a limited number of observations. Note that a fairly high number of samples was 
needed to obtain small relative errors due to the presence of noise. The number of iterations needed was 
almost constant (around 36), no matter the size of the problems. The CPU times (in seconds) needed are 
also reported. 
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Table 5.2 

Numerical results for noisy matrix completion problems 



rr spr 


iter rclX rclY cpu 


iter rclX rclY cpu 




SR = 90%, n = 500 


SR = 80%, n = 500 


0.05 0.05 


36 4.60e-5 4.25e-6 137 


36 3.24e-5 4.31e-6 153 


0.05 0.1 


36 4.68e-5 5.29e-6 156 


36 4.40e-5 4.91e-6 161 


0.1 0.05 


36 4.04e-5 3.74e-6 128 


36 1.28C-3 1.33e-4 129 


0.1 0.1 


36 6.00e-4 4.50c-5 129 


35 1.06e-2 7.59e-4 124 




SR = 90%,n = 1000 


SR = 80%, n = 1000 


0.05 0.05 


37 3.10e-5 3.96c-6 1089 


37 2.27e-5 4.14e-6 1191 


0.05 0.1 


37 3.20e-5 4.93e-6 1213 


37 3.00e-5 4.66e-6 1271 


0.1 0.05 


37 2.68e-5 3.34e-6 982 


37 1.75e-4 2.49e-5 994 


0.1 0.1 


37 3.64e-5 4.51e-6 1004 


36 4.62e-3 4.63e-4 965 



5.4. Sparse Inverse Covariance Selection. In [43] ALM method was successfully applied to the 
Sparse Inverse Covariance Selection problem: 

(5.16) min F{X) ^ f{X) + g{X), 

where f{X) = -logdet(X) + {S,X) and g{X) = p\\X\\i. 

Note that in our case f{X) does not have Lipschitz continuous gradient in general. Moreover, f{X) is 
only defined for positive definite matrices while g{X) is defined everywhere. These properties of the objective 
function make the SICS problem especially challenging for optimization methods. Nevertheless, we can still 
apply Algorithm 4 and obtain the complexity bound in Theorem 2.3 as follows. As proved in [33], the 
optimal solution X* of (5.16) satisfies X ^ al, where a = jg^^fp;^, (see Proposition 3.1 in [33]). Therefore, 
the SICS problem (5.16) can be formulated as: 

(5.17) min{/(A:) + g{Y) : X - Y ^ 0, X e C,Y e C}, 

where C := {X e 5" : X ^ f /}. We can apply Algorithm 4 and Theorem 2.3 as per Remark 2.7. The 
difficulty arises, however, when performing minimization in Y (Step 5 of Algorithm 4) with the constraint 
Y G C. Without this constraint, the minimization is obtained by a matrix shrinkage operation. However, the 
problem becomes harder to solve with this additional constraint. Minimization in X (Step 3 of Algorithm 
4) with or without the constraint X E C is accomplished by performing an SVD of the current iterate Y'' . 
Hence the constraint can be easily imposed. Also note that once the SVD is computed both V/(X'^+^) and 
V/(y'^) are readily available (see [43] for details). This implies that either skipping or nonskipping iterations 
of Algorithm 4 can be performed at the same cost as one ISTA iteration. 

Instead of imposing constraint Y C in Step 5 of Algorithm 4 we can obtain feasible solutions by a 
line search on /i. We know that the constraint X ^ is not tight at the solution. Hence if we start the 
algorithm with X }z ctl and restrict the step size fi to be sufficiently small then the iterates of the method will 
remain in C. Similarly, one can apply ISTA with small steps to remain in C. Note however, that the bound 
on the Lipschitz constant of the gradient of f{X) is l/a^ and hence can be very large. It is not practical 
to restrict in the algorithm to be smaller than a^, since ^ determines the step size at each iteration. The 
advantage of ALM methods over ISTA in this case is that as soon as the Y G C is relaxed ISTA can no longer 
be applied, while ALM/SADAL can be applied and indeed works very well. The theory in this case only 
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applies once certain proximity to the optimal solution has been reached. But as shown in [43], the SADAL 
method is computationally superior to other state-of-the-art methods for SICS. 

We have also applied the FALM method to the SICS problem, but we have not observed any advantage 
over ALM for this particular application. 

6. Conclusion. In this paper, we proposed both basic and accelerated versions of alternating lineariza- 
tion methods for minimizing the sum of two convex functions. Our basic methods require at most 0(l/e) 
iterations to obtain an e-optimal solution, while our accelerated methods require at most 0{1/^) iterations 
with only a small additional amount of computational effort at each iteration. Numerical results on image 
deblurring, background extraction from surveillance video and matrix completion with grossly corrupted 
data are reported. These results demonstrate the efficiency and the practical potential of our algorithms. 

Acknowledgement. We would like to thank Dr. Zaiwen Wen for insightful discussions on the topic of 
this paper. 
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